Frontiers in Artificial Intelligence — Latest Matching Preprints

1

Interpretability as stability under perturbation reveals systematic inconsistencies in feature attribution

Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.

2026-04-22 health informatics 10.64898/2026.04.20.26351354 medRxiv

Top 0.1%

6.5%

Show abstract

Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.

2

Development of Explainable Machine Learning Framework for Early Detection and Risk Stratification of Diabetes in Age Specific Variations

Lukhele, N.; Mostafa, F.

2026-04-27 health informatics 10.64898/2026.04.25.26351733 medRxiv

Top 0.1%

4.8%

Show abstract

Objective To develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. Methods A clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs >35 years) and gender. SHAP was developed for model interpretability. Results Ensemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged >;35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. Conclusion This study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.

3

Dynamic and Baseline Multi-Task Learning for Predicting Substance Use Initiation in the ABCD Study

Wei, M.; Zhang, H.; Peng, Q.

2026-04-13 addiction medicine 10.64898/2026.04.10.26350655 medRxiv

Top 0.1%

4.2%

Show abstract

Background: Early initiation of substance use is linked to later adverse outcomes, and risk factors come from multiple domains and are shared across substances. In our previous work, traditional time-to-event Cox models identified individual risk factors, but these models are not designed to jointly model multiple outcomes or capture complex non-linear relationships. Multi-task learning (MTL) can leverage shared structure across related outcomes to improve prediction and distinguish common versus substance-specific predictors. However, most MTL studies rely on baseline features and focus on single outcomes, which limits their ability to capture shared risk and temporal changes. Substance use initiation is a time-dependent process that unfolds during development and reflects changing exposures over time. Baseline-only models cannot capture these changes or represent risk dynamics. Discrete-time modeling provides a practical approach by estimating interval-level initiation risk and combining it into cumulative risk at the subject level. By integrating multi-task learning with dynamic modeling, it is possible to share information across outcomes while capturing how risk evolves over time, which may improve prediction performance. Methods: Using the Adolescent Brain Cognitive Development (ABCD) Study (release 5.1), we developed two complementary multi-task learning (MTL) frameworks to predict initiation of alcohol, nicotine, cannabis, and any substance use. A baseline MTL model predicted fixed- horizon (48-month) initiation using one record per participant, while a dynamic discrete-time MTL model incorporated longitudinal interval data to model time-varying risk. Both models used multi-domain environmental exposures, core covariates, and polygenic risk scores (PRS). Performance was evaluated on a held-out test set using AUROC, PR-AUC, and calibration metrics, and compared with single-task logistic regression (LR). Feature importance was assessed using permutation importance and compared with Cox proportional hazards models. Results: MTL showed comparable or improved performance relative to LR, with larger gains for low-prevalence outcomes (cannabis and nicotine). Incorporating longitudinal information led to consistent improvements across all outcomes. Dynamic models increased AUROC by +0.044 to +0.062 for MTL and +0.050 to +0.084 for LR, indicating that temporal information was the primary driver of performance gains. Feature importance analyses showed modest overlap across methods, with higher agreement between dynamic MTL and Cox models than static MTL. A small set of features, including externalizing behavior, parental monitoring, and developmental factors, were consistently identified across all approaches. Conclusions: Dynamic multi-task learning improves the prediction of substance use initiation by leveraging longitudinal structure and shared information across outcomes. While MTL provides additional gains, incorporating time-varying information is the dominant factor for improving performance. Combining baseline and dynamic frameworks offers a comprehensive strategy for identifying robust risk factors and modeling adolescent substance use initiation.

4

Comparison of foundation models and transfer learning strategies for diabetic retinopathy classification

Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.

2026-04-20 health informatics 10.64898/2026.04.17.26351092 medRxiv

Top 0.1%

4.2%

Show abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.

5

Perioperative Mortality Prediction Using a Bayesian Ensemble with Prevalence-Adaptive Gating

Pandey, A. K.

2026-04-06 health informatics 10.64898/2026.04.03.26350114 medRxiv

Top 0.1%

3.6%

Show abstract

Background: Perioperative mortality prediction in resource-limited surgical settings remains challenging due to class imbalance, missing data, and the heterogeneity of postoperative complications. Existing risk scores such as POSSUM depend on intraoperative variables and do not quantify prediction uncertainty. Methods: We developed a prevalence-adaptive Bayesian ensemble comprising three stochastic models: classifier Variational Autoencoder (VAE, AUC=0.95), a Flipout Last Layer network (AUC=0.84), and a Monte Carlo Dropout network (AUC=0.80), trained on 697 patients (39 deaths, prevalence 5.59%) with 67 preoperative and postoperative features. Class imbalance (16.9:1) was addressed through Variational Autoencoder augmentation: two class-conditional generative VAEs produced 619 synthetic survivor and 619 synthetic death records, yielding a balanced training corpus of 1,935 samples. VAE augmentation was selected over SMOTE and random oversampling after a comparative study (F1: random oversampling 0.61 vs VAE augmentation 0.77). Validation used a held-out set of 233 patients (13 deaths, 220 survivors). A six-stage prediction pipeline incorporated weighted base risk, a three-path prevalence-adaptive gate, Shannon entropy uncertainty quantification, and rank-transform calibration. Sensitivity analysis was conducted across all six empirically derived hyperparameters. A whole-cohort death audit evaluated all 52 deaths from the complete 930-patient dataset through the deployed clinical decision support system. Statistical analysis included Kruskal-Wallis testing of entropy across triage groups, Wilson score confidence intervals for performance metrics, and Spearman rank correlation for LIME-SHAP interpretability concordance. Results: On the validation cohort the ensemble achieved complete separation (sensitivity 100%, specificity 100%, Youden J=1.000; TP=13, FP=0, TN=220, FN=0). The whole-cohort death audit identified 36 of 52 deaths (sensitivity 69.2%, 95% CI 55.7%-80.1%; precision 100%, 95% CI 90.4%-100.0%; F1=0.818, bootstrap 95% CI 0.732-0.894). Shannon entropy differed significantly across triage levels (Kruskal-Wallis H(2)=24.212, p<0.001, {epsilon}2=0.453), confirming a monotone gradient SAFE < CRITICAL < GRAY ZONE. All six hyperparameters were invariant across their tested ranges (J=1.000 throughout; Supplementary Tables S1-S2). LIME and SHAP rankings showed statistically significant concordance (Spearman {rho}=0.440, p=0.024; Kendall T=0.357, p=0.011), with 4 of 6 principal mortality determinants shared across both methods. Conclusions: A prevalence-adaptive Bayesian ensemble with entropy-based uncertainty triage achieves zero false positive alerts and clinically meaningful audit sensitivity in perioperative mortality prediction. Complete hyperparameter invariance confirms that reported performance reflects structural properties of the calibration architecture. The 16 missed deaths represent feature-invisible cases beyond current observational feature capacity.

6

Revisiting Reconstruction Likelihood: Variational Autoencoders for Biological and Biomedical Data Clustering

Korenic, A.; Özkaya, U.; Capar, A.

2026-04-12 bioinformatics 10.64898/2026.04.09.717460 medRxiv

Top 0.2%

2.4%

Show abstract

Background and ObjectiveVariational Autoencoders (VAEs) offer a powerful framework for unsupervised anomaly detection and data clustering, often surpassing traditional methods. A core strength of VAEs lies in their ability to model data distributions probabilistically, enabling robust identification of anomalies and clusters through reconstruction likelihood -- a stochastic metric providing a principled alternative to deterministic error scores. MethodsWe investigated how different VAE architectures, combining reconstruction likelihood with a learnable or data-driven prior, performed in a clustering task on a toy dataset such as MNIST. Results were verified using dimensionality reduction techniques like t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), alongside clustering algorithms such as k-means and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). ResultsThe VAEs encoder inherently maps data points into a latent space exhibiting discernible cluster structure, as evidenced by alignment with ground truth labels. While dimensionality reduction techniques (both t-SNE and UMAP) facilitated the application of clustering algorithms (k-means and HDBSCAN), these methods were primarily used to visualize and interpret the latent space organization. ConclusionsThis study demonstrates that VAEs effectively cluster data by implicitly encoding assignments in their latent representations. Determining cluster membership from encoder output, combined with reconstruction likelihood using semantic features, offers a principled approach for identifying typical samples and anomalies. Future research should focus on leveraging this inherent clustering capability of VAEs to enhance interpretability and facilitate clinical application.

7

An independent supervisory safety agent improves reaction of large language models to suicidal ideation

Trivedi, S.; Simons, N. W.; Tyagi, A.; Ramaswamy, A.; Nadkarni, G. N.; Charney, A. W.

2026-04-15 psychiatry and clinical psychology 10.64898/2026.04.13.26350757 medRxiv

Top 0.2%

2.1%

Show abstract

Background: Large language models (LLMs) are increasingly used in mental health contexts, yet their detection of suicidal ideation is inconsistent, raising patient safety concerns. Objective: To evaluate whether an independent safety monitoring system improves detection of suicide risk compared with native LLM safeguards. Methods: We conducted a cross-sectional evaluation using 224 paired suicide-related clinical vignettes presented in a single-turn format under two conditions (with and without structured clinical information). Native LLM safeguard responses were compared with an independent supervisory safety architecture with asynchronous monitoring. The primary outcome was detection of suicide risk requiring intervention. Results: The supervisory system detected suicide risk in 205 of 224 evaluations (91.5%) versus 41 of 224 (18.3%) for native LLM safeguards. Among 168 discordant evaluations, 166 favored the supervisory system and 2 favored the LLM (matched odds ratio {approx}83.0). Both systems detected risk in 39 evaluations, and neither in 17. Detection was highest in scenarios with explicit suicidal ideation and lower in more ambiguous presentations. Conclusions: Native LLM safeguards frequently failed to detect suicide risk in this structured evaluation. An independent monitoring approach substantially improved detection, supporting the role of external safety systems in high-risk mental health applications of LLMs.

8

Enhancing Medical Knowledge in Large Language Models via Supervised Continued Pretraining on Clinical Notes

Weissenbacher, D.; Shabbir, M.; Campbell, I. M.; Berdahl, C. T.; Gonzalez-Hernandez, G.

2026-04-04 health informatics 10.64898/2026.04.02.26350065 medRxiv

Top 0.2%

1.9%

Show abstract

Background: Large language models (LLMs) contain limited professional medical knowledge, as large-scale training on clinical text has not yet been possible due to restricted access. Objectives: To continue pre-training an open-access instruct LLM on de-identified medical notes and evaluate the resulting impact on real-world clinical decision-making tasks and standard benchmarks. Methods: Using 500K de-identified clinical notes from Cedars-Sinai Health System, we fine-tuned a Qwen3-4B Instruct model with supervised learning to generate medical decision-making (MDM) paragraphs from patient presentations, and evaluated it on assigned-diagnosis prediction, in-hospital cardiac-arrest mention detection, and a suite of general and biomedical benchmarks. Results: The fine-tuned model produced MDMs that closely resembled those written by physicians and outperformed the base-instruct model and larger clinically untrained models (Qwen3-32B and Llama-3.1-405B Instruct) on assigned-diagnosis prediction, the task most aligned with its training objective. On the task of detecting in-hospital cardiac arrest mentions, the model initially exhibited mild label collapse, but a brief task-specific fine-tuning stage resolved this issue and allowed it to surpass all competitors. The model also demonstrated global general knowledge retention on biomedical and general-domain evaluation benchmarks compared to the baseline. Conclusion: Supervised full fine-tuning on clinical notes allowed the model to incorporate medical knowledge without sacrificing general-domain abilities, and to transfer this knowledge to unseen biomedical tasks without wholesale loss of general-domain abilities, while revealing collapse-related failure modes that motivate more principled strategies for clinical specialization.

9

Developing a Tiered Machine Learning Alert System for Real-Time Suicide Risk Detection in a Digital Mental Health Setting

Donegan, M. L.; Srivastava, A.; Peake, E.; Swirbul, M.; Ungashe, A.; Rodio, M. J.; Tal, N.; Margolin, G.; Benders-Hadi, N.; Padmanabhan, A.

2026-03-30 psychiatry and clinical psychology 10.64898/2026.03.26.26349452 medRxiv

Top 0.3%

1.7%

Show abstract

The goal of this work was to leverage a large corpus of text based psychotherapy data to create novel machine learning algorithms that can identify suicide risk in asynchronous text therapy. Advances in the field of natural language processing and machine learning have allowed us to include novel data sources as well as use encoding models that can represent context. Our models utilize advanced natural language processing techniques, including fine-tuned transformer models like RoBERTa, to classify risk. Subsequent model versions incorporated non-text data such as demographic features and census-derived social determinants of health to improve equitable and culturally responsive risk assessment, as well as multiclass models that can identify tiered levels of risk. All new models demonstrated significant improvements over our previous model. Our final version, a multiclass model, provides a tiered system that classifies risk as "no risk," "moderate," or "severe" (weighted F1 of 0.85). This tiered approach enhances clinical utility by allowing providers to quickly prioritize the most urgent cases, ensuring a more accurate and timely intervention for clients in need.

10

A Machine Learning Based Causal Interface for Time-Varying Environmental Predictors of Substance Use Initiation in the ABCD Study

Wei, M.; Yadlapati, L.; Peng, Q.

2026-04-17 addiction medicine 10.64898/2026.04.15.26350988 medRxiv

Top 0.3%

1.7%

Show abstract

Background: The Adolescent Brain Cognitive Development (ABCD) Study provides rich longitudinal data on environmental, genetic, and behavioral factors related to substance use initiation. Classical marginal structural models (MSMs) require selecting covariates for propensity models, which is challenging when there are many correlated predictors. Methods: We analyzed longitudinal panel data from 11,868 ABCD participants with repeated observations over time. Interval-level binary outcomes were defined for initiation of alcohol, nicotine, cannabis, and any substance, including only participants at risk before initiation. All predictors were constructed as lagged variables to preserve temporal ordering. We used a two-stage machine learning-based causal framework. First, we performed graph discovery using a Granger-inspired lagged predictive modeling approach with elastic-net logistic regression to identify relationships between past predictors and future outcomes. Stable candidate edges were selected using subject-level bootstrap stability selection. Second, we estimated adjusted effects for stable predictors using double machine learning (DML) with partialling-out and cross-fitting. For each predictor, the lagged variable was treated as the exposure and adjusted for high-dimensional lagged covariates. Cross-fitting with group-based splitting accounted for within-subject dependence. Nuisance functions were estimated using random forests, and cluster-robust standard errors were used for inference. Results: We identified stable predictors across multiple domains, including sleep patterns, family environment, peer relationships, behavioral traits, and genetic risk. Many predictors were shared across substance outcomes, while some were outcome-specific. Effect sizes were modest, typically ranging from -0.01 to 0.02 per standard deviation increase in the predictor. Both risk-increasing and protective associations were observed. Risk factors included sleep disturbance and behavioral risk indicators, while protective factors included parental monitoring and structured environments. Conclusions: This study presents a practical framework for analyzing high-dimensional longitudinal data and identifying time-varying predictors of substance use initiation. The approach combines machine learning for variable selection with causal inference for effect estimation. The results highlight both shared and outcome-specific risk factors and identify modifiable targets, such as family environment and sleep, that may inform prevention strategies.

11

A Deployable Explainable Deep Learning System for Tuberculosis Detection from Chest X-Rays in Resource-Constrained High-Burden Settings

Agumba, J.; Erick, S.; Pembere, A.; Nyongesa, J.

2026-04-01 radiology and imaging 10.64898/2026.03.31.26349662 medRxiv

Top 0.3%

1.5%

Show abstract

Abstract Objectives: To develop and evaluate a deployable deep learning system with Gradient-weighted Class Activation Mapping (Grad-CAM) for tuberculosis screening from chest radiographs and to assess its classification performance and explainability across desktop and mobile deployment platforms. Materials and methods: This study used publicly available chest X-ray datasets containing Normal and Tuberculosis images. A DenseNet121-based transfer learning model was trained using stratified training, validation, and test splits with data augmentation and class weighting. Model performance was evaluated using accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC). Grad-CAM was used to visualize regions influencing model predictions. The trained model was converted to TensorFlow Lite and deployed in both a Windows desktop application and a Flutter-based mobile application for offline inference and visualization. Results: The model demonstrated strong classification performance on the independent test dataset, with high accuracy and AUC values indicating effective discrimination between Normal and Tuberculosis cases. Grad-CAM visualizations showed that the model focused primarily on anatomically relevant lung regions, particularly the upper and mid-lung fields in Tuberculosis cases. Deployment testing confirmed consistent prediction outputs and Grad-CAM visualizations across both Windows and mobile platforms. Conclusion: The proposed deployable deep learning system with Grad-CAM provides accurate and interpretable tuberculosis screening from chest radiographs and demonstrates feasibility for offline mobile and desktop deployment. This approach has potential as an artificial intelligence-assisted screening and decision support tool in radiology, particularly in resource-limited and remote healthcare settings.

12

Fourier Analysis of Bilateral Breast Asymmetry for Short-term Breast Cancer Risk Prediction

Heine, J.; Fowler, E.; Egan, K.; Weinfurtner, R. J.; Balagurunathan, Y.; Schabath, M. B.

2026-03-30 radiology and imaging 10.64898/2026.03.27.26349508 medRxiv

Top 0.3%

1.4%

Show abstract

A substantial body of evidence demonstrates that measures from mammograms are predictive of breast cancer risk. In this matched case-control study, mammograms acquired near the time of diagnosis were analyzed to investigate bilateral breast asymmetry as measure of short-term risk prediction. Specifically, contralateral breast images were compared with measures derived in the Fourier domain (FD); this technique summarizes power in concentric radial bands that cover the Fourier plane. Equivalently, this approach can be described as a multiscale characterization of the image. The summarized power difference between respective contralateral bands produces an asymmetry measure. Full field digital mammography (FFDM) and synthetic two-dimensional images from digital breast tomosynthesis (DBT) were investigated for women that had both types of mammograms acquired at the same time. Odds ratios (ORs) and the area under the receiver operating curves (Azs) were generated from conditional logistic regression modeling with 95% confidence intervals. Raw unprocessed FFDM images produced significant findings: OR = 1.90 (1.58, 2.29) and Az = 1.72 (0.67, 0.76) per one standard deviation unit. Associations were significant but attenuated for both clinical FFDM and DBT images: OR = 1.31 (1.11, 1.54) and Az = 0.63 (0.58, 0.67); and OR = 1.48 (1.25, 1.76) and Az = 0.65 (0.60, 0.70), respectively. Results suggest that clinical FFDM and DBT images are inferior to raw FFDM images in capturing breast asymmetry with information loss for breast cancer risk prediction. Moreover, these DBT images have lower spatial resolution but produced stronger associations than the clinical FFDM images.

13

Predicting long term clinical outcomes in Parkinson's Disease using short term rating scales

Burnell, M.; Gonzalez-Robles, C.; Zeissler, M.-L.; Bartlett, M.; Clarke, C. S.; Counsell, C.; Hu, M. T.; Foltynie, T.; Carroll, C.; Lawton, M.; Ben-Shlomo, Y.; Carpenter, J.

2026-03-30 neurology 10.64898/2026.03.27.26349548 medRxiv

Top 0.4%

1.3%

Show abstract

Background: Most trials of Parkinson's disease (PD) measure progression over a short to medium time-period using continuous rating scales that may be hard to interpret and less meaningful for patients. There is a lack of evidence connecting changes in these scales to changes in outcomes important to patients. Objectives: We present causal modelling to translate the causal, short-term disease-modifying treatment effects on functional rating scales to the 10-year risk of serious clinical progression milestones. Methods: We selected four important clinical milestones of disease progression from the Oxford Parkinson's Disease Centre "Discovery" cohort: dementia, any falls, frequent falls, and mortality. We proposed a causal framework for our research objectives so we could model the potential impact of a 30% reduction in disease progression slopes ("treatment effect") using the summation of parts I and II of the Movement Disorders Society Unified Parkinson's Disease Rating Scale (UPDRS12). This outcome was regressed on time to milestone using flexible parametric survival models. Marginal predictions of survival and survival difference at year 10 were then calculated for the Discovery cohort, and a counterfactual cohort applying the treatment effect to estimate the relative and absolute reductions for the four clinical milestones. Results: The model increase in risk for each unit change in the UPDRS12 were as follows: dementia hazard ratio (HR)=1.52 (95% Confidence Interval (CI) 1.36-1.70), any falls HR=1.37 (95% CI 1.29-1.46), frequent falls HR=1.68 (95% CI 1.49-1.89), mortality=1.29 (95% CI 1.17-1.42). These models led to marginal predictions of absolute reductions, when the progression was reduced by 30%, between 4.0% (mortality) and 7.5% (frequent falls) at 10 years follow up. Conclusions: We have demonstrated how a treatment effect in a trial specified in terms of a progression change of a rating scale can be contextualised into a long-term reduction in the probability of clinically relevant milestones. Whilst we have used PD as our exemplar, we believe this methodological approach is generalisable to other chronic progressive diseases where trials are often limited to a relatively short follow-up period and use some scalar measure of progression, but significant clinical milestones usually take longer to be observed. Keywords: Clinical trials; disease modifying therapies; causal estimation; prediction models

14

Do Amyloid Trajectories Reach a Physiologic Ceiling? Evidence from Iterative Approximation and Simulation

Gantenberg, J. R.; La Joie, R.; Heston, M. B.; Ackley, S. F.

2026-04-21 epidemiology 10.64898/2026.04.14.26350359 medRxiv

Top 0.4%

1.2%

Show abstract

Qualitative models of Alzheimers pathology often posit that amyloid accumulation follows a sigmoid curve, indicating that the rate of deposition wanes over time. Longitudinal PET data now allow us to investigate amyloid accumulation trajectories with greater detail and over longer follow-up periods. We combine inferences from simulated amyloid trajectories, empirical PET data from the Alzheimers Disease Neuroimaging Initiative (ADNI), and the sampled iterative local approximation algorithm (SILA) to assess whether amyloid accumulation reaches a physiologic ceiling. We find that SILA reliably detects a ceiling, when present, across a range of simulated scenarios that impose a sigmoid shape. When fit to empirical data from ADNI, however, SILA does not appear to indicate the presence of a ceiling. Thus, we conclude that amyloid trajectories may not reach a physiologic ceiling during the stages of Alzheimers disease typically observed while patients remain under follow-up in cohort studies. Fits using SILA indicate that illustrative models of biomarker cascades, while useful tools for conceptualizing and interrogating pathologic processes, may not represent the shapes of amyloid trajectories accurately. Summary for General PublicAmyloid, a protein implicated in Alzheimers disease, is thought to reach a plateau in the brain, but methods that estimate how amyloid changes over time suggest it grows unabated. Gantenberg et al. use one such method and simulations to argue that amyloid does not reach a plateau during the typical course of Alzheimers.

15

Exploring the Relationship Between Non-Suicidal Self-Injury and Problematic Sexual Behaviour

Jiang, S.; Foo, J. C.; Roper, L.; Yang, E.; Green, B.; Arnau, R.; Behavioral Addictions Studies and Insights Consortium, ; Lodhi, R. J.; Isenberg, R.; Wishart, D. S.; Fujiwara, E.; Carnes, P. J.; Aitchison, K. J.

2026-04-25 addiction medicine 10.64898/2026.04.17.26351044 medRxiv

Top 0.4%

1.2%

Show abstract

Objectives: Non-suicidal self-injury (NSSI) and self-harming sexual behaviours share functional and behavioural overlaps. However, the relationship between NSSI and problematic sexual behaviour (PSB) remains underexplored. This study aimed to investigate the association between NSSI and PSB in two cohorts - a non-clinical university cohort and a clinical PSB patient cohort. Methods: Data were collected from 2,189 university participants and 477 clinical PSB patients. NSSI was assessed via self-report, and PSB was measured with the Sexual Addiction Screening Test-Revised (SAST-R) Core. The four core addictive dimensions of PSB: relationship disturbance, loss of control, preoccupation, and affect disturbance, were also evaluated. Logistic regression analyses were conducted to examine the association between PSB (presence/absence and severity) and NSSI, looking at effects of gender and contributions of addictive dimensions of PSB. Results: Rates of NSSI were similar in the university (7.1%) and patient (5.7%) cohorts; stratified by gender, a higher proportion of women PSB patients had NSSI compared to in the university cohort (29.3% vs 9.3%). In the university group, who had milder PSB than patients, PSB was associated with NSSI (OR=2.11, p<0.001); a significant gender by PSB interaction was found showing that women with PSB were over four times more likely to have NSSI than men without PSB (OR=4.44, p=0.037). In contrast, PSB severity was not associated with NSSI in PSB patients (OR=1.10, p=0.25). Associations of the addictive dimensions of PSB with NSSI were observed only in the subgroup of university women, in the 'preoccupation' dimension (p<0.001). Conclusions: Our findings highlight gender-specific patterns in the association between PSB and NSSI, suggesting the need for further research and possibly targeted prevention and intervention strategies in women.

16

BSO-AD: An Ontology for Representing and Harmonizing Behavioral Social Knowledge in ADRD

Li, H.; Yu, Y.; Bhandarkar, A.; Kumar, R.; Clark, I. H.; Hu, Y.; Cao, W.; Zhao, N.; LI, F.; Tao, C.

2026-03-31 health informatics 10.64898/2026.03.30.26349756 medRxiv

Top 0.5%

1.2%

Show abstract

Objective: Behavioral and social factors (BSFs) substantially influence the risk, onset, and progression of Alzheimer disease and related dementias (ADRD). A systematic representation of their interplay is essential for advancing prevention and targeted interventions. However, BSF-related knowledge is scattered across heterogeneous sources, limiting scalable evidence synthesis and computational analysis. To address this, we created a Behavioral Social Data and Knowledge Ontology for ADRD (BSOAD) to represent and integrate BSFs with respect to ADRD. Material and Methods: BSOAD was developed following established ontology design principles, prioritizing reuse of existing ontology elements to ensure semantic interoperability. It was built upon the Social Determinants of Health Ontology (SDoHO) and the Drug-Repurposing Oriented Alzheimer Disease Ontology (DROADO). BSF-related classes were enriched with ICD 10 CM Z55 Z65 codes and ADRD related classes with AD Onto. Relationships between BSFs and ADRD were derived through literature mining. Ontology quality was evaluated through Hootation based expert review and an LLM assisted framework assessing structural coverage and semantic coherence. Results: BSO AD contains 2275 classes, 153 object properties, and 49 data properties. Expert review demonstrated strong rational agreement (0.95), with disagreements resolved through discussion. LLM-based evaluation showed high category coverage rates ([≥] 0.97) and robust semantic alignment with the relevant literature (average completeness = 0.79; conciseness = 0.94). Discussion and Conclusion: BSOAD is, to our knowledge, the first ontology to systematically represent BSFs and hierarchically model their interrelationships in ADRD. It establishes a semantic backbone for computational analysis and knowledge integration. The LLM assisted evaluation framework demonstrates the feasibility of scalable, automated ontology assessment.

17

The Visual Hemofilter: a novel visualization technology that improves task performance among intensive care professionals: A prospective simulation study.

Bider-Lunkiewicz, J.; Gasciauskaite, G.; Rück Perez, B.; Braun, J.; Willms, J.; Szekessy, H.; Nöthiger, C.; Hoffmann, M.; Milovanovic, P.; Keller, E.; Tscholl, D. W.

2026-04-20 intensive care and critical care medicine 10.64898/2026.04.16.26351012 medRxiv

Top 0.5%

1.1%

Show abstract

PurposeThis study evaluates the Visual Hemofilter, a novel decision-support and information transfer tool designed to assist with regional citrate anticoagulation (RCA) in hemofiltration. By representing hemofilter parameters and patient blood constituents as animated icons, the tool aims to improve clinicians interpretation of blood gas results and RCA reference tables. We hypothesized that the Visual Hemofilter would enhance clinical decision-making by enabling faster and more accurate therapy adjustments, increasing clinicians confidence in their decisions, and reducing cognitive workload compared to conventional methods. MethodsWe conducted a prospective, randomized, computer-based simulation study across four intensive care units at the University Hospital Zurich. Twenty-six critical care professionals participated, each managing regional citrate anticoagulation (RCA) scenarios using either the Visual Hemofilter or conventional methods involving blood gas analysis and reference tables. Following each scenario, participants made therapy adjustments and rated their decision confidence and cognitive workload. ResultsUse of the Visual Hemofilter significantly improved decision accuracy (odds ratio [OR] 3.96; 95% CI 2.03-7.73; p < 0.0001) and reduced decision time by an average of 33 seconds (mean difference -33.3 seconds; 95% CI -39.4 to -27.2; p < 0.0001). Participants also reported greater confidence in their decisions (OR 5.41; 95% CI 2.49-11.77; p < 0.0001) and experienced lower cognitive workload (mean difference -15.05 points on the NASA-TLX scale (National Aeronautics and Space Administration-Task Load Index); 95% CI -18.99 to -11.13; p < 0.0001). ConclusionsThe Visual Hemofilter enhances clinical decision-making in RCA by increasing accuracy and speed, boosting decision confidence, and reducing cognitive workload. This technology has the potential to reduce errors and better support critical care professionals in managing complex treatment scenarios.

18

Multi-task deep learning integrating pretreatment MRI and whole slide images predicts induction chemotherapy response and survival in locally advanced nasopharyngeal carcinoma

Hou, J.; Yi, X.; Li, C.; Li, J.; Cao, H.; Lu, Q.; Yu, X.

2026-04-11 radiology and imaging 10.64898/2026.04.07.26350350 medRxiv

Top 0.5%

1.1%

Show abstract

Predicting response to induction chemotherapy (IC) and overall survival (OS) is critical for optimizing treatment in patients with locally advanced nasopharyngeal carcinoma (LANPC). This study aimed to develop and validate a multi-task deep learning model integrating pretreatment MRI and whole slide images (WSIs) to predict IC response and OS in LANPC. Pretreatment MRI and WSIs from 404 patients with LANPC were retrospectively collected to construct a multi-task model (MoEMIL) for the simultaneous prediction of early IC response and OS. MoEMIL employed multi-instance learning to process WSIs, PyRadiomics and a convolutional neural network (ResNet50) to extract MRI features, and fused multimodal features through a multi-gate mixture-of-experts architecture. Clustering-constrained attention multiple instance learning and gradient-weighted class activation mapping were applied for visualization and interpretation. MoEMIL effectively stratified patients into good and poor IC response groups, achieving areas under the curve of 0.917, 0.869, and 0.801 in the train, validation, and test sets, respectively, and outperformed the deep learning radiomics model, the pathomics model and TNM staging. The model also stratified patients into high- and low-risk OS groups (P < 0.05). MoEMIL shows promise as a decision-support tool for early IC response prediction and prognostication in LANPC. Author SummaryWe have developed a deep learning model that integrates two types of medical images, including magnetic resonance imaging (MRI) and digital pathological slices, to simultaneously predict response to induction chemotherapy and prognosis in patients with locally advanced nasopharyngeal carcinoma. Current treatment decisions primarily rely on traditional tumor staging (TNM), which often fails to comprehensively reflect the complexity of the disease. Our model, named MoEMIL, was trained and tested on data from 404 patients across two hospitals and consistently outperformed both single-model approaches and TNM staging methods. By identifying patients who exhibit poor response to induction chemotherapy or higher prognostic risk, our tool can assist clinicians in achieving personalized treatment, enabling intensified management for high-risk patients and avoiding unnecessary side effects for low-risk patients. Additionally, we visualize the models reasoning process through heat map generation, which highlights the image regions exerting the greatest influence on prediction outcomes. This work represents a step toward more precise treatment for nasopharyngeal carcinoma; however, larger-scale prospective studies are required before the model can be integrated into routine clinical practice.

19

Attitudes and Perceptions Toward the Use of Artificial Intelligence Chatbots for Peer Review in Medical Journals: A Large-Scale, International Cross-Sectional Survey

Ng, J. Y.; Bhavsar, D.; Dhanvanthry, N.; Bouter, L.; Chan, T.; Cramer, H.; Flanagin, A.; Iorio, A.; Lokker, C.; Maisonneuve, H.; Marusic, A.; Moher, D.

2026-04-07 health informatics 10.64898/2026.04.07.26350263 medRxiv

Top 0.5%

1.0%

Show abstract

Background: Artificial intelligence chatbots (AICs), as a form of generative artificial intelligence (AI), are increasingly being considered for use in scholarly peer review to assist with tasks such as identifying methodological issues, verifying references, and improving language clarity. Despite these potential benefits, concerns remain regarding their reliability, ethical implications, and transparency. Evidence on how medical journal peer reviewers perceive the role and impact of AICs is limited. This study explored reviewers' familiarity with AICs, perceived benefits and challenges, ethical concerns, and anticipated future roles in peer review. Methods: We conducted a cross-sectional online survey of medical journal peer reviewers. Corresponding author information was extracted from MEDLINE-indexed articles added to PubMed within a two-month period using an R-based approach. A total of 72,851 authors were invited via email to participate; those who self-identified as peer reviewers were eligible. The 29-item survey assessed familiarity with AICs and perceptions of their benefits and limitations in peer review. The survey was administered via SurveyMonkey from April 28 to June 16, 2025, with two reminder emails sent during the data collection period. Results: A total of 1,260 respondents completed the survey. Most participants were familiar with AICs (86.2%) and had used tools such as ChatGPT for general purposes (87.7%), but the majority had not used AICs for peer review (70.3%). Most respondents reported that their institutions do not provide training on AIC use in peer review (69.5%), although many expressed interest in such training (60.7%). Perceptions of AIC benefits were mixed, while concerns were widely shared, particularly regarding potential algorithmic bias (80.3%) and issues related to trust and user acceptance (73.3%). Conclusions: While familiarity with AICs is high among medical journal peer reviewers, their use in peer review remains limited. There is clear interest in training and guidance, however, concerns related to ethics, data privacy, and research integrity persist and should be addressed before broader implementation.

20

Attitudes and Perceptions of Generative Artificial Intelligence Chatbots in the Scientific Process of Traditional, Complementary, and Integrative Medicine Research: A Large-Scale, International Cross-Sectional Survey

Ng, J. Y.; Tan, J.; Syed, N.; Adapa, K.; Gupta, P. K.; Li, S.; Mehta, D.; Ring, M.; Shridhar, M.; Souza, J. P.; Yoshino, T.; Lee, M. S.; Cramer, H.

2026-04-15 health informatics 10.64898/2026.04.13.26350612 medRxiv

Top 0.5%

0.9%

Show abstract

Background: Generative artificial intelligence (GenAI) chatbots have shown utility in assisting with various research tasks. Traditional, complementary, and integrative medicine (TCIM) is a patient-centric approach that emphasizes holistic well-being. The integration of TCIM and GenAI presents numerous key opportunities. However, TCIM researchers' attitudes toward GenAI tools remain less understood. This large-scale, international cross-sectional survey aimed to elucidate the attitudes and perceptions of TCIM researchers regarding the use of GenAI chatbots in the scientific process. Methods: A search strategy in Ovid MEDLINE identified corresponding authors who were TCIM researchers. Eligible authors were invited to complete an anonymous online survey administered via SurveyMonkey. The survey included questions on socio-demographic characteristics, familiarity with GenAI chatbots, and perceived benefits and challenges of using GenAI chatbots. Results were analysed using descriptive statistics and thematic content analysis. Results: The survey received 716 responses. Most respondents reported familiarity with GenAI chatbots (58.08%) and viewed them as very important to the future of scientific research (54.37%). The most acknowledged benefits included workload reduction (74.07%) and increased efficiency in data analysis/experimentation (71.14%). The most frequently reported challenges involved bias, errors, and limitations. More than half of the respondents (57.02%) expressed a need for training to use GenAI chatbots in the scientific process, alongside an interest in receiving training (72.07%). However, 43.67% indicated that their institutions did not offer these programs. Discussion: By developing a deeper understanding of TCIM researchers' perspectives, future AI applications in this field can be more informed, and guide future policies and collaboration among researchers.